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Abstract —Here we present an implementation of Primal-Dual 
Affine scaling method to solve linear optimisation problem on 
GPU based systems. Strategies to convert the system generated 
by complementary slackness theorem into a symmetric system 
are given. A new CUBA friendly technique to solve the resulting 
symmetric positive definite subsystem is also developed. Various 
strategies to reduce the memory transfer and storage require¬ 
ments were also explored. 
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1. Introduction 

A Linear Programming Problem concerns with maximising 
or minimising an objective function subjected to certain con¬ 
straints. Canonical form of a linear programming problem can 
be written as follows. 

minimize c^x 

Ax = b (1) 

X > 0 

The form given in Q is called primal problem and its 
corresponding dual problem is expressed as 

maximize y 

A^y + 5 = c (2) 

5 > 0 

A 

X y G is set of basic variables 

cgM^, beR^ 
s gM^ is the slack variable 

In equation Q and Q, A is assumed to have full rank and 
m < n. A solution to the primal problem is a vector x* G 
such that e^x* is minimised and the constraints Ax* = b, 
X* > 0 are each satisfied. 

II. Solving Linear Programming Problems 

The constraints of a linear optimisation problems forms an 
n dimensional polyotope and the objective function is an n- 
1 diminutional plane in the same space. The minimum of 
objective function is has its minimum at one of the vertices 
of this constraint polyotope. This fact was proved by Chong 
m. most of the linear programming methods are based on 
searching for this vertex from an initial point in the polyotope. 

According to the way in which the searching progresses, 
the methods to solve linear programming problems can be 


broadly divided into Boundary methods and Interior point 
methods. Boundary methods always stays in the boundary of 
the polyotope and search for the optimum vertex where the 
objective function is minimum. 

Simplex method, the most common boundary point methods 
start with an initial feasible solution and move towards adja¬ 
cent vertices, constantly reducing value of objective function 
in the process till it cannot be reduced further. A vivid account 
of how the method works and the theory behind it can be found 
in Chong (Tl and Nazareth (71 

Interior point methods start by selecting a solution inside 
the polyotope and iteratively advance the point towards the 
optimum node. The fastest methods pick initial point in the 
neighbourhood of center of the polyotope and follow a path 
central from all nodes towards the optimum nodes. Detail 
account of this can be found in Saigal © 


Xopt 



Polytope of a two-dimensional feasible region. The x^ points 
are determined by an interior point method, while the x • 
ones by a simplex method. 

III. The Algorithm 

By complementary slackness theorem in Dantzig and Orden 
m , the following system of non-linear is generated for the 
primal and Dual problems described in equations Q and Q. 

Ax = b 

A^y-i- s = c 

Xs = 0 

On applying newtons method to this system we can find the 
optimum direction to which the next step is to be taken. This 



is given by 


'A O O' 


Ax 


0 

0 I 


Ay 

= 

0 

SOX 


As 




(3) 


Where S and X are diagonal matrices with diagonal entries 
of s and x respectively. 

The primal dual affine scaling algorithm from Saigal O 
can be outlined as follows: 


Algorithm Primal-Dual Affine Scaling 

Let be an initial interior solution 

0 < a < 1 

k:=0 

While TRUE do 

Solve system to obtain optimal 
direction vectors (Ax^, As^) 

If Ax^ = 0, Ay^ = 0 and As^ = 0 then 
is Optimum. STOP 

End If 

= max {^(-A^lAx^), 0(-5'^^A5^)} 

(p(u)=max{uj} for any vector u 

If = 1 then 

a '.= 1 

End If 

^/c+l _ ^/c ^Ax^ 

yk+1 =yk 

gk-\-l _ gk ^/\gk 

If = 1 then 

{x^,y^,s^) is Optimum .STOP 

End If 
k=k+l 
End While 

end 


The most important part of this algorithm is in solving the 
system 0- As we can see from the algorithm itself, this 
system has to be solved in every iteration. This create a heavy 
overhead as all other steps in the algorithm are trivial. This 
work mainly concentrates on solving the system 

IV. Previous Work 

The area of Linear optimisation is well developed in and 
has a strong theory base. Some of the earlier works in this 
field have made the subject evolve and reach its current state. 
Theoretically the fastest known interior point algorithm to date 
is due to Vaidya (El which requires 0(((m + n)n^ + (m + 
ny-^n)L) arithmetic operations where m is the number of 
constraints, and n is the number of variables. The algorithm 
has been reported to work in 0(L) precision. Also the work 
of Mehora 0 on implementation of primal-dual methods 
are encouraging. In his work, the primal-dual method have 
been approximated by second order terms to provide faster 
convergence and use of potential functions to find direction 
and stay close to central path along iteration paved the way for 


development of central path based algorithms. Goldberg et.al 
O describes the use of interior point methods to find solution 
of bipartite matching and minimum cost related problems. It 
shows that Linear optimisation can be used to solve various 
other problems from different domains. 

Recent developments in GPU technology has allowed the 
growth of Linear programming into this domain as well. 
Spampinato and Lister ifTTIl implemented a variant of simplex 
method on NVIDIA GPUs using CUDA programming model. 
A speedup of 2x-2.5x was reported. Work of Smith et.al 
Col on Linear optimisation problems with unknown constraint 
matrix also opened up a new area of research. They have 
used accelerated sparse matrix-vector product computation to 
accelerate convergence rate. 

V. The Proposed Strategy 

The major bottleneck of the Primal-Dual Affine Scaling 
algorithm is in solving the system of equations 0- Practical 
Optimisation problems consist of thousands of constraints and 
millions of variables. As such the matrix A is quiet large and 
system ^ is even larger. So we propose the following strategy. 

Simple Row column interchange will transform system 0 
into following symmetric system 


'0 /■ 


Ax 


■ 0 ■ 

A 0 0 


Ay 

= 

0 

I 0 D 


As 


— X 


Where D = S ^Xisannxn diagonal matrix. This gives 
the components as 

Ax =D[—As] — Ix 
{ADA^)Ay =Ax 

As = — A^ Ay 

Our main concern is in solving the system {ADA^)Ay = Ax. 
A. Solving {ADA^)x = b 

These type of systems are usually solved using cholesky 
factorisation as {ADA^) is symmetric and positive definite. 
But this strategy is not optimum for architecture like CUDA 
as the decomposed matrix has a triangular shape easing to 
lots of overhead to coordinate and idle threads. More over, 
the triangular system obtained at the end gives no scope of 
parallelism at all if the system is dense. 

A better optimisation strategy involves the observation that 
only the D matrix in ADA^ changes in every iteration. Thus, 
if we can develop a strategy to split the matrix into a constant 
part and a variable part, then lots of computation can be 
done before iteration begins. Such a strategy is obtained from 
Woodbury formula. 

(A + = A“i + GDG^ 

If A = AA^, f/ = a and y = A{D-I) then (A+UV'^)-'^ = 
{ADA^)~^. When the change in matrix is considered as series 
of rank one updates, the following formula can be derived for 
directly finding solution of the system. 















Xl =Xi_i 


yi,k =yi-i,k 


vfxi^i 

xfyi-i,k 


yi-i,i 

yi-i,i 


Where Y = [^o,i|^o, 2 | • • • \yo,n] = {AA^)-^A and xq = Yb. 
Xn is the solution of the system {ADA^)x = b. This 
formulation is due to Egidi and MaponiQ. 

B. CUD A Related Optimisations 

1) Data Storage: The x vector can be appended as a column 
with the Y matrix so that the operations can be done uniformly. 
Another requirement was to store the inner products. So the 
following distribution is adopted. 


Y 

X 

Inner Products 


V matrix entries are only accessed once and hence we can 
refrain from storing it by calculating the required entry directly 
as follows: 

= Aiji^Di^i 1 ) 

2) Partitioning for thread allocation: Since there is no 
dependencies between columns, we will follow a partitioning 
strategy where each column is assigned to one block and inside 
the block, one thread takes care of only one element. When 
no of elements in column is more than maximum limit of 
threads, a ID cyclic pattern of thread to element assignment 
is followed inside a block. Figure 1 describes the assignment 
of blocks to column. 



Fig. 1. Assignment of blocks to matrix 


3) Memory access pattern: As the adjacent threads is 
assigned to adjacent elements of a column, the access pattern 
is strided by the matrix size in x direction. So to regularise 
memory access pattern, the whole matrix is stored in its trans¬ 
posed format. This transformation gives coalesced memory 
access throughout out the algorithm which is highly desirable 
for an architecture like CUDA. 


4) Inner product computation: This operation can be opti¬ 
mised for execution in CUDA by following strategies: 

• The data is first read into shared memory and reduction 
is done only from shared memory 

• Using the Add During Load technique, idle threads can 
be eliminated. Moreover, total number of threads and 
required amount of shared memory can be reduced into 
half. This allows the algorithm to work with larger data 
set. 

• Last 5 iteration can be unrolled and explicit synchroni¬ 
sation can be avoided because the threads in a warp has 
implicit synchronisation between them. 

5) Kernel Calls: Multiple kernel calls are used to produce 
inter block synchronisation. To avoid kernel call overhead, the 
concept of constant memory is used. The argument values of 
all the kernels is copied to the constant memory residing in the 
device. Kernels directly access the arguments from constant 
memory and hence passing argument from host to device 
can be eliminated. Moreover parameter reading from constant 
memory is fast operation as all threads would simultaneously 
accessing same data which can be satisfied with just one read 
inside a warp. Moreover internal caching mechanism removes 
the over head of multiple reads. Hence this strategy keeps 
kernel call overhead to a minimum 

VI. Experimental Results 

The code has been executed in Tesla cluster with following 
technical details 

• Each node of the cluster consists of Four AMD Quad- 
Core Opteron 8378 processors with 2.4Ghz clock speed. 

• Each of these compute nodes is also connected to a 
NVIDIA-Tesla S1070 GPGPU node 

• Each Tesla node is composed of 4 GPUs with each GPU 
made up of 240 processor cores. 

• 64GB Main Memory 

• Nvidia Tesla S1070 lU server with 4 GPU’s operating at 
1.296Ghz 


Serial Execution Time 



Fig. 2. Serial code execution time 




















LP Solver Execution Time 


CUDA Execution Time 



Fig. 3. Execution time of the code implemented in CUDA 



Number of variables and constraints 



Speedup Chart 



Fig. 4. This chart shows the speedup of the parallel code execution compared 
to the serial version. Highest observed speedup was 22x 



Fig. 5. Comparison of the Execution time of the whole algorithm with the 
execution time given in Jung et. al (4) . A speedup of 2x is seen from this 
comparison. 

VII. Future Work 

1) The above formulation of the problem is for dense 
systems. This can be extended towards sparse systems. 
Maintaining the same degree of coalesced memory ac¬ 
cess will be challenge there. 

2) The application of Woodbury formula proved to be 
a good alternative for solving linear system for this 
kind of problems. The derived formula proved to be an 


Fig. 6. Comparison of the Execution time between the method described in 
this work and the work done by Spampinato and Elster El on implementing 
simplex method on GPU 

Optimum CUDA friendly algorithm for solving system 
of equations. This can be pursued further to be extended 
towards sparse systems. Maintaining sparsity of the 
system will be a challenge there. 

3) We have only tested this work in CUDA architecture 
with compute capability 1.0. This work needs to be 
tested with higher capabilities devising new techniques 
to use architectural features available there. 

4) Much more work has to be done in finding an efficient 
technique to reduce kernel call overhead. The technique 
describe here in this research effort proved to be fruitful 
one. Other options exist which can be pursued further. 

5) Although this formulation is successful in reducing 
computation time by totally eliminating the need for 
memory transfer in-between iterations, the area of hybrid 
execution can be explored if asynchronous memory 
transfers can be used. 
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